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ABSTRACT 

Motivation: The prediction of RNA 3D structures from its sequence 
only is a milestone to RNA function analysis and prediction. In 
recent years, many methods addressed this challenge, ranging from 
cycle decomposition and fragment assembly to molecular dynamics 
simulations. However, their predictions remain fragile and limited to 
small RNAs. To expand the range and accuracy of these techniques, 
we need to develop algorithms that will enable to use all the structural 
information available. In particular, the energetic contribution of 
secondary structure interactions is now well documented, but 
the quantification of non-canonical interactions — those shaping 
the tertiary structure — is poorly understood. Nonetheless, even 
if a complete RNA tertiary structure energy model is currently 
unavailable, we now have catalogues of local 3D structural motifs 
including non-canonical base pairings. A practical objective is thus 
to develop techniques enabling us to use this knowledge for robust 
RNA tertiary structure predictors. 

Results: In this work, we introduce rna-moip, a program that 
benefits from the progresses made over the last 30 years in 
the field of RNA secondary structure prediction and expands 
these methods to incorporate the novel local motif information 
available in databases. Using an integer programming framework, 
our method refines predicted secondary structures (i.e. removes 
incorrect canonical base pairs) to accommodate the insertion of RNA 
3D motifs (i.e. hairpins, internal loops and /c-way junctions). Then, 
we use predictions as templates to generate complete 3D structures 
with the MC-Sym program. We benchmarked rna-moip on a set 
of 9 RNAs with sizes varying from 53 to 128 nucleotides. We show 
that our approach (i) improves the accuracy of canonical base pair 
predictions; (ii) identifies the best secondary structures in a pool of 
suboptimal structures; and (iii) predicts accurate 3D structures of 
large RNA molecules. 

Availability: rna-moip is publicly available at: http:// 

csb.cs.mcgill.ca/RNAMolP. 

Contact: jeromew@cs.mcgill.ca 



1 INTRODUCTION 

Ribonucleic acids perform a broad range of functions in cells. 
Ribozymes such as the RNase P or the group II introns catalyze 
chemical reactions, whereas microRNAs hybridize to messenger 
RNA to regulate gene expression. To achieve these functions, 
many RNAs fold into specific 3D structures that are directly 



encoded in their nucleotide sequence. The structural information is 
therefore useful to predict the function. Nonetheless, experimental 
determination of RNA structures remains time-consuming and 
technically challenging. It follows that we need to develop fast and 
reliable computational tools to help to predict them. 

During the last few years, several groups have developed fully 
automated RNA 3D stru cture prediction progra ms. To date, the most 
popular ones are FAR NA ( Das and BakeAboOTj), the MC - P ipel i ne 
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(IParisien and Maioii l2008h iFoldRNA dSharma et a/.L l2008h and 
NAST Jjonikas et fl?ll2009l) . A recent review bv lLaing and Schlickl 
d2010h proposes a comprehensive overview of these strategies. 
Conditional random fields techniques im plemented in BARNA CLE 
dFrellsen gfo/ll2009h and TreeFolder JWang and Xull201ll) also 
appear as a promising approach. 

However, unlike cla ssical secondary struct u re pre dictors such 
as RNAs true tu re dReuter and Mathewsl l20ld). R N A fold 
(iHofackeil 120091). unafold jMarkham and Zukeii 120081) 
cont raf old iDo et q/.Ll2006h or context fold (IZakov et all 
l201ll) , the range of application of the RNA 3D structure predictors 
is limited. Currently, their time requirement and/or their accuracy 
restrict their application range to sequences with <50 nucleotides. 
By contrast, secondary structure predictors are fast and reliable on 
seque nces with >100 nucleotides. MC-Fold dParisien and Maioii 
120081) and RNAwolf izu Siederdissen et a/ll201lh expanded these 
techniques to predict extended secondary structures (i.e. including 
non-canonical interactions), but these algorithms remain limited 
to predict nested secondary structures without £-way junctions, 
thus, precisely lacking the structural motifs shaping the RNA 
3D structure. 

Thus, ab-initio 3D structure prediction of large RNA molecules 
(i.e. >50 nucleotides in our context) is still an open question. To 
overcome this barrier, new models are required. Indeed, due to the 
paucity of structural data available, the design of a complete model 
accounting for all the subtle 3D structural variations observed in 
experimentally determined structures is unlikely. 

The methods developed in this article are based on a recent 
idea suggesting that RNA 3D structures share common structural 
subunits. The decomposition of RNA structures in eleme ntary 
blocks was first introduced by iLemieux and Maiod d2Q02h who 
proposed a description of RNA secondary structures (including 
non-canonical interactions) based on cycles. More recently, the 
analysis of experimental 3D structures revealed that similar 3D 
motifs can be found in multiple unrelated structures. Here, we 
define a motif as a group of nucleotides that adopt a specific 3D 
shape and interaction pattern (including non-canonical interactions). 
Several groups have developed computational methods to extract 
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and cl assify RNA 3D m otifs. The m ost popular databases are 
FR3D dSarver et a/.Ll2008h. RNAmo t i f jpielloul and Deniseil2QQ8h 
and RNAi unction dBindewald et all I2OO8I) . Importantly, these 
databases identify 3D motifs involving three or more segments of 
the same molecule defined as &-way junctions (when the motif is 
the branching point of several helical segments). Such motifs are 
important because they are precisely those shaping the 3D structure 
of an RNA molecule. 

Despite the knowledge accumulated in these databases, the 
integration of this information into current models remains 
complicated. First, the classification of RNA motifs can be 
ambiguous (i.e. a motif and its submotifs can match different 
database entries). Next, the structural compatibility between two or 
more motifs can be difficult to resolve (i.e. how to concatenate two 
motifs). It is worth noting that a method to predict the topological 
family of a given 3 -w ay junction has been recently introduced by 
lLamiable etal\ 1201 2h . 

Interestingly , to co mpl ement the secondary structure programs, 
iMartinez et al\ 120081) and ljossinet et al\ 1201 Oh implemented semi- 
automated methods (resp. RNA2D3D and assemble) for building 
3D models from known/predicted secondary structure information. 
These programs provide intuitive interfaces enabling their users to 
insert 3D motifs and modify backbone angles of a coarse grained 
input structure. 

From this standpoint, the hierarchical approaches (i.e. RNA2D3D 
and assemble) appear well suited to the prediction of large 
RNA structures. Their advantage resides in their capacity to 
benefit from the high accuracy of classical secondary structure 
predictors (i.e. thermodynamic or comparative models) to build a 
scaffold of the structure, and then to leave to the user the task of 
decorating the model with the various structural motifs found in 
databases. Although this strategy is flexib le, it is time-consum i ng and 
requires human participation. Recently, ICruz and WesthoJ l201ll) 
developed RMDetect, a method to predict G-bulge loops, kink- 
turns, C-loops and tandem-GA loops in RNA secondary structures. 
But the prediction of more complex motifs such as the &-way 
junctions and the construction of 3D RNA structures remain open 
problems. 

In this article, we introduce RNA-MoIP, an integer programming 
(IP) framework for inserting RNA 3D motifs inside known (or 
predicted) RNA secondary structures. We use our predictions as a 
template to generate putative RNA 3D structures using the MC-Sym 
software, and show that we are able to predict accurate 3D structures 
of large RNA sequences. Integer programming techniques have 
gained a lot of interest recently as they provided state-of-the-art 
metho ds for predicting RNA secondary st ructures with pseudo- 
knots IPoolsap et fl7ll2009l : ISato et a/.ll201lh . One of their strengths 
resides in their flexibility and capacity to incorporate heterogeneous 
constraints, a valuable advantage when it comes to incorporate 
&-way junctions. 

The article is organized as follows. In Section[2]we formally define 
the motifs, describe our motif database, and introduce our IP model 
RNA-M oIP. In Sectional we appl y RNA -Mo I P on a set of nine RNA 
used bv lLaing and Schlickl 120101) to benchmark RNA 3D structure 
prediction programs. Our results show that RNA-MoIP (i) improves 
the accuracy of canonical base pair predictions; (ii) identifies the best 
secondary structures in a pool of suboptimal structures generated by 
RNAsubopt; and (iii) predicts accurate 3D structures for sequences 
with sizes varying between 53 and 128 nucleotides — an insight that 



cannot be reached by other programs. Finally, in Section |4] we 
discuss our results and propose future research directions. 

2 METHODS 

Let w be a RNA sequence. First, we use a classical secondary structure 
predictor (e.g. RNAsubopt) to generate a list of sub-optimal secondary 
structures. Second, for each structure from the list we use RNA-MoIP 
to insert RNA 3D motifs in the structure using the sequence information 
provided by go. RNA-MoIP works in two steps: 

(1) Given a database of sequences of RNA 3D motifs (cf. Section IZ21 . 
the preprocessing step applies a classical pattern matching algorithm 
to find all occurrences of each motif in the input sequence go. 

(2) Given this list of potential insertion sites and a secondary structure, 
we solve an IP problem which minimize our objective function 
(cf. Section |23}. Importantly, under certain conditions, RNA-MoIP 
allows base pair removals to insert the 3D motifs. 

Fi nally, we use the best solutions as templates for MC-Svm ( Parisien and 
Maior j2008l) and generate 3D structures. In particular, we constrain MC - Sym 
to use the motifs inserted by RNA-MoIP at their predicted location. As we 
will see later, these constraints enable us to produce 3D structures, when an 
unconstrained run would simply never end. 

2.1 Definitions 

Motif: We represent a motif x as an ordered list of components (i.e. 
sequences) where x 1 i is the i-th nucleotide of the j'-th component (i.e 
sequence). As presented in Figure ^ hairpins have one component, bulges 
and internal loops have two and ^-way junctions have k. Let r be the number 
of components, we represent a motif as x:= [(jcJ , • • • , ),•••, (x[ , • • • , x r k )] 

and x 1 i e{A,U ,G, C, *} where * represents a wildcard. We also write a motif 
x as M x :=x\ • • -x^ — x\ • • — x\ • • -x r kr , i.e. the concatenations of its letters 
with the added character '— ' between the components. We define \M X \ as 
the number of nucleotides in x. 

Match: Given a sequence coe{A,U ,G,C} + , and a motif x with r 
components, we say that goi is the i-th character of go, and that a motif 
x match the sequence go at (p\ , • • • ,p r ) if V 1 < i < r : pi + 5 < pi + \ and 



We/i k} '• ^i—^pj+i-i where the p z -'s indicate the first positions of the i-th 
component of motif x in go. The inequality ensures that each component is 
separated by at least five nucleotides. 

2.2 RNA motifs database 

Here we describe how we build the motifs database. First, we retrieve 
888 e xperimentally determi ned RNA 3D structures from the Protein Data 
Bank iBerman et all l2000l: www . pdb . org] . Th en, we use the program 
RNA3Dmotif s from iDielloul and DeniselEo08l) to extract all the motifs 
from these structures. This results in a dataset of 35 724 motifs for which 
we have a 3D pdb file and a description of the interactions. 

We processed these data to create a non-redundant database of curated 
motifs. To ensure the compactness and coherency of the motifs, we assume 
that each component is at least five nucleotides farther than the previous one; 
otherwise the nucleotides are merged in a single component and the missing 
positions are replaced by a wildcard V (See Section IzTV We describe a 
motif m returned by RNA3Dmotif s as m:={(m\,pi), ,(m n ,p n )}, where 
mk e {A, U, G, C} and pk < Pk + i e N is the position of nucleotide m\ in the 
sequence it was extracted from. We create x such that if we set x , i =mj ( and 
1 <Pk+i —Pk = a <5 then Vi < i' < i + a : x 1 ^ = * and *J +a =m^+i . If Pk+i — 
Pk>$ then x 7 ^ 1 =mj i+ \ . 

Some motifs may have small components composed of one or two 
nucleotides. In our framework, the insertion of these components will be 
less constrained by the secondary structure and thus less specific. To avoid 
this case, we extend all small components in all possible combinations with 
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Fig. 1. Th is is an example of motifs extracted by Rna3Dmotif r Dielloul 
and DeniseT2008|n^n^U^iven^W these 
as input it defines the following. The hairpins form the group with one 
component and we write: A:=[(GGAAAC)], B:= [(CGAAAG)]]. Interior 
loops, and Bulges, have two components (e.g. C := [(GAU), (AG AU GC)]). 
The Ti-way junctions naturally have n components. In this case there is a 3- 
way junction which can be written in our framework in two ways: D:= 
[(CGAA),(UGU AAC),(GG*)] or D:=[(CGAA),(UGU AAC),(*GG)], 
since we want components to be of size at least 3. D can also be written as 
M D :=CGAA-UGU AAC — GG* (resp. CGAA-UGU AAC-*GG) and 
we can say that D match this sequence at (6,35,47) but also, (6,35,48) and 
many other positions. A motif can be inserted multiple times 

the character * until they reach a size of three (e.g. the last component of 
motif D in Figure [TJ. 

It is worth noting that all these constraints are empirical rules which aim to 
remove discrepancies and unify the sequence constraints applied on motifs. 
They should not be considered as a rigid framework but rather as a tentative 
to clarify the RNA3Dmotif s output. 

Finally, we cluster together all pairs of motifs x,y if M x =M y (i.e. the 
sequences are identical) to obtain a non-sequence-redundant database of 4708 
motifs. 

It is important to note that in our database, the motifs with one single 
component are all hairpins and do not include bulges. In this article, bulges 
will be seen as a particular case of interior loops since for the motif to loop, 
it needs to include the complementary strand. 

Finally, we excluded from this database the structures used in the 
benchmark (see Section lT2l . 

2.3 IP model 

Here, we describe the IP equations used to insert the motifs into a given 
secondary structure. To insert a motif into the structure, our model allows 
some base pairs to be removed. 

2.3. 1 Input: We introduce the notations and sets that will be used to model 
our input data. Let be a RNA sequence, and S a secondary structure of co 
without pseudo-knots. We denote by n= \co\ the length of co, and by 8 the 



maximum percentage of base pairs that is allowed to be removed. We call B 
the set of base pairs found in the secondary structure S. We denote by Mot- 7 
the set of motifs with j components that match co: 

Mot- 7 = {x|jc :=[(*},••• ,xl ),---,(j^,---,4 )] and 

(1) 

3 a match of x in co} 

We store in Seq^ the positions where the i-th component of the motifs of 
order j can be inserted: 

Seq- 7 - = {{x,pi,pi +ki — l)\xe Mot- 7 and 

(2) 

3 a match (p\, •• • ,Pi-i,Pi,Pi+i, ••• ,Pj) of x in co} 

We note that the criteria used to determine whether a motif can be inserted 
is based on the sequence only. At this stage, the secondary structure S is not 
used. 

2.3.2 Variables: We now describe the two variables used in our model. 
Our program will make two predictions: first, it finds the location of the 
insertion sites of the motifs, and second, it predicts which base pairs are 
removed. We denote C£*J as the boolean variable indicating the insertion of 
the j-th component of the motif x between positions k and / in co. Similarly, we 
use the boolean variable D u v to indicate if the base pair (u, v)eBis removed 
or not (i.e. D M)V = 1 if (u,v) is removed from the secondary structure S and 
0 otherwise). 

2.3.3 Objective function: We describe here the optimization criteria that 
will be used to predict the insertion of the RNA motifs. As mentioned earlier, 
we do not have any estimate of the energy of the motifs retrieved with 
RNA3Dmotif s. Instead, we will use a principle of minimum entropy. We 
assume that a molecule folds in a configuration that stabilizes its backbone 
and side chains through various base pairings. In other words, we aim to 
minimize the free variables of the molecule. In the absence of reliable 
energy values, we assign to the motifs a weight equivalent to the square 
of the number of nucleotides in its components. This objective function 
aims to increase the coherency of the motif insertions as it maximizes 
the nucleotide positions coverage and favours the insertion of large motifs 
instead multiple small ones. It also ease the 3D reconstruction process with 
MC-Sym. Although this objective function is purely heuristic, it performed 
well in this work. We give a penalty of 10 for every base pair deleted (other 
values have been tried with similar results, data not shown). Formally, we 
aim to minimize the following function: 

io* E E O^D 2 - E c u) (3) 

2.3.4 Constraints: Here, we describe the constraints that we use to ensure 
the correctness of the motif insertion and to control the coherency of the final 
structure. We detail these equations below. 

Hairpins: 

\ M ^ q \--C x k j< £ (i-A,,v)+ 

(u,v)eB 
k-l<u<kAl<v<l+l 

E c u+ £ <t? 

(jc,JEJ)eSeqf (x,kj)eSeqj 
l=k-\ k=l+l 

We use Constraint {4) to insert the hairpins (i.e. x e Mot 1 ). A hairpin can 
be inserted if and only if one of two following criteria holds: A base pair 
(u, v) e B exists such that both extremities are stacked or overlap on the motif 
x (Fig.(2^), or there is an inserted motif y with two components (i.e. y e Mot 2 ) 
such that x is nested inside y and stacked onto one of its components (Fig.0>). 
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(a) 




Fig. 2. {2^) shows how a hairpin can be inserted in our model, with both extremities stacked, or overlapping a canonical base pair. In jlj)) we show the last 
configuration, where we require the hairpin to be between the components of a 2-way junction and to have at least one extremity stacked. In j2j) we show 
another view of Figure ^ without the base pairs overlapping the 3-way junction. We can notice that we can go from any component of the 3-way junction to 
any other without crossing the base pairs 



Interior loops and bulges: 



V(m, v) eB,Wxe Mot 2 : -n-D UtV < 



E 



E 



C x k j<n-D u , 



(5) 



(jc,]U)€Seqf 
l<uVv<k 



(x,k,l)eSeq2 
l<uvv<k 



V(jc,M)€Seqf,V(jc,JU)| 



£>/a2- 



*D+ E ^- 

(u,v)eB (u,v)gB 
k<u<lAk<v<l k<u<l®k<v<l 



■k + l-k+1 



(6) 

Constraints J5j and <|6j are used to insert bulges and interior loops. 
Constraint J5j stipulates that for all base pairs (u,v)eB, every motif in 
Mot 2 must have as many first component inserted before u or after v, as it 
has second components, allowing to create an arc between the components 
of every motif without creating a pseudo-knot with the base pairs in the 
secondary structure. Constraint 0 allows both components to be inserted 
only if they fill at least 2 unpaired positions. Indeed, such insertion would 
most likely not produce valuable structural information. 



k-way junctions: 



E 



(7) 



^ 3 

(x,k,l)eSeq{ 



Y/>3,V(w,v)efl: -n-D u , v < 



o--D- E 



(x,k,l)eSeq\ 
u<k<l<v 



- E qf:^*-Ai,v 
\<i<j 

(x,k,l)zSeq{ 



(8) 



Constraints {7J and |8j describe how &-way junctions are inserted. 
Constraint 0 restricts the number of inserted motifs with three or more 
components to one, which is a reasonable assumption given the size of 
the RNAs. Combined with J8), it means that for every conserved base pair 
(u,v)eB, a motif can be inserted if all or none of the components are 
between u and v. This is equivalent, as we can see in Figure |2j;, to saying 
that we can connect the components which are shown in red without creating 
a pseudo-knot with the base pairs in the secondary structure. 



Motifs completeness: 

V 1 < i < j, V(jc, k,l)e Seq-J : C*j < ^ C~f 



Oc,M)eSeq^. +1 
l+5<k 



V 1 < i <j, V(x, k,l)e Seq{ : C£j < C~f 



(9) 



(10) 



(11) 



(x,A;,/)eSeq^_ 1 
~l<k-5 

V/>1, VxeMotA VI <*</: ^ C^'J- ^ c o = ° 

(x,A:,/)eSeq J j (x,fc,7)eSeq] 

Constraints lf9l . (TTol and fTB ensure that the insertions of the components 
in co respect their order given in the motif. Constraints J9) and flol require 
that if C^'-J is the j-th component of motif x and it is inserted at positions 
k,l, then at least one (j — l)-th component of the same motif should be 
inserted five nucleotides above, and one (/ + l)-th component after. The 
last constraint restricts that, since a motif can be inserted many times, the 
multiplicity of every component should be equal to the multiplicity of the 
first component. 

Secondary structure constraints: 

Y/ > 1, VI < i <j,V(x,k,I) e Seq-j : C*j < J}l -D u , v ) (12) 

(u,v)gB 
k—l<u<kv 
l<u<l+\v 
k-l<v<kv 
l<v<l+l 

Vl<u<n: £ C U + J E ^-Du)+\ £ C W ^ 1 



Oa,/)eSeq^ 
k<u<l 



(k,l)GB 
k=uvl=u 



(x,k,l)eSeq J f 
k=uvl=u 



y\<u<n:(l- 



J2 D ~u,v)-iX~ J2 D ~ u ^+ 



(u,v)gB 
u=u—lvv=u—l 



(u,v)gB 
u=uvv=u 



J2 D u,v)>0 



(u,v)eB 
u=u+lvv=u+l 



£ av<*-i*i 



(13) 



(14) 



(15) 



We conclude by describing the constraints regulating the secondary 
structure properties. Constraint (\2\ use the secondary structure to guide the 
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sites of the components by allowing insertions if and only if one extremity 
overlaps or is stacked on top of a base pair. Constraint fL3l forbids two 
components from overlapping to each other, and prevents base pairs to 
occurinside^^omDor^ 

et fl/. j2QQ9h to prevent lonely base pairs (i.e. every position in a base pair 
must also have an adjacent position in a base pair). Constraint HTi limits 
the number of canonical base pairs 8 of S that can be removed. 

3 RESULTS 

3.1 Implementation 

To sol ve the IP problem, we use the Gurobi optimizer 

v.4.5. 1 faoustonlEoill) API for Python. We ran our benchmark on a 
Ubuntu-Server 10.04 on a Dell PE T610 2x Intel Quad core X5570 
Xeon Processor, 2.93 GHz 8 M Cache, 64 GB Memory (8x8 GB), 
1333 MHz Dual Ranked RDIMMs for 15 Processors, Advanced 
ECC. 

3.2 Dataset 

We va lidate our method on the dataset defined by (lLaing and SchlickL 
l2010h to benchmark the RNA 3D structure prediction programs. 
In this work, we aim to predict the structure of large RNAs with 
3-way and 4-way junctions. Small sequences (<50 nucleotides) 
with simpler structures can be accurately predicted using existing 
methods such as MC- Pipeline or NAST. Therefore, we removed 
from the dataset sequences with <50 nucleotides. We also 
removed RNAs with secondary structures that include pseudo- 
knots. Indeed, our approach has been designed to use secondary 
structures predicted by classical secondary structure predictors 
such as RNAf old and RNAs true ture, thus without pseudo- 
knots. Moreover, our motif database and IP model have not been 
designed to insert pseudo-knots. We redirect the reader interested in 
application of I P techniques to the p redi ction of pseudo-k nots to the 
recent works of lPoolsap etol\ d2QQ9h and lSato etol\ 1201 lh . Our final 
dataset includes 11 RNAs with sequences of lengths ranging from 
53 to 128 nucleotides. We note that 2 of these 11 had no homologous 
3-way junctions in our database. We present here the results on the 
remaining nine RNAs. Eight of them have a 3-way junction and the 
other a 4-way junction. Imp ortantly, the motifs extracted from these 
RNAs by RNA3Dmotif s (iDielloul and Denisei l2008h have been 
removed from our motif database. 

For the negative control test, we used a test set composed of the 
24 RNAs from the dataset defined by C.Laing and T.Schlick without 
pseudo-knot, 3 or 4-way junction. Their sizes range from 16 to 77 
nucleotides. 

3.3 RNA-MolP pipe-line 

Our RNA tertiary structure prediction pipe-line works in three steps. 
First, secondary structures are predicted using classical predictors 
such as RNAf old, RNAstructure or unaf old. In this work, we 
gene rated the input secondary structures with RNAsubopt f W uchty 
We used the default parameters but discarded structures 
with lonely pairs (i.e stems of length 1). This procedure generated 
between 1 and 22 secondary structures for each RNA sequence. 
Nonetheless, the quality of secondary structure predictions is too 
low on the riboswitch 3D2G from Arabidopsis thaliana and the 
tRNA 2DU3 from Archaeo globus fulgidus to accommodate k- 
way junction motifs insertion. Therefore, we extended our list of 



suboptimal structures and generated all secondary structures in the 
range of 4.5 kcal/mol from the mfe. This operation resulted in a total 
of 242 (resp. 58) secondary structures. We also note that extending 
the list of suboptimal structures of other RNAs produces identical 
results. Typically the secondary structure predictors generate lists of 
suboptimal structures from which it is difficult to extract the best 
ones. We will see in Section l3".3.2l that our method is able to identify 
the best candidates in these ensemble predictions. 

We apply RNA-MolP to insert RNA 3D motifs in these secondary 
structures as described in Section [231 The solution with an optimal 
score, under our objective function (Section 12.3.31 ) is scripted 
manually for MC-Sym with the motifs locked in. Due to various 
MC-Sym features, it is currently difficult to generate automatically 
these scripts. Hence, the processing of very large sequence datasets 
remains challenging. We recall that many 3D structures can have 
the same motif, which is only determined by the sequence. Here 
we provide all alternative configurations to MC-Sym. Time is a 
major limitation of MC-Sym. Using our strategy, we show that 
preprocessing the sequences with RNA-MolP results in a dramatic 
time improvement and at the same time improves the accuracy. We 
set a time limit of 30 min. Then on every set of predicted structures 
we apply a minimization of steepest-descent until [(G RMS <5 
Kcal/mol/A) or (steps > 500)] dParisien and Maiorllioojl) . It is worth 
noting that MC-Sym was not able to generate a structure in two 
cases (3E5C and 2GDI), although RNA-MolP predicted the 3-way 
junctions at the correct positions. 

3.3.1 Negative control We verify that RNA-MolP does not 
predict wrong insertions (i.e. false positives). We use a negative 
control dataset composed of the 24 RNAs extracted from the dataset 
of C.Laing and T.Schlick that contain hairpins and interior loops 
motifs but without pseudo-knots and &-way junctions. Then, we 
apply the protocol described as in Section [331 Our results indicate 
that no &-way junction have been inserted in the optimal solution 
returned by RNA-MolP. 

3.3.2 Secondary structure The identification of the best 
secondary structures in a list of suboptimals is one major challenges 
in RNA secondary structure ensemble prediction. As we can see in 
Table \T\ the average base pair accuracy of the secondary structure 
prediction is ~63%. But when we look at the base pair accuracy of 
the secondary structures selected by RNA-MolP (78%) we observe a 
major improvement of 15% which means that our approach is able 
to identify the best secondary structures in a pool of candidates. 
Interestingly, our program is able to extract candidates with a very 
low rank. For instance, on 2DU3 RNA-MolP extracts the 163-th 
candidate with a base pair accuracy of 91% (versus 43% in average) 
in a pool of 258 structures. Finally, to accommodate motif insertions 
RNA-MolP can remove base pairs. Once removed, the ratio of 
well-predicted base pairs reaches 84%, thus increases by 6%. This 
experiment demonstrates that the insertion of motifs can help to 
identify incorrectly predicted base pairs. 

3.3.3 Three-dimensional structure We evaluate the quality of 
our 3D structure prediction s using the RMSD and the RNA 
Interaction Network Fidelity dGendron et all l200lh tool, available 
with the MC-Pipeline atmaj or . iric . ca/MC-Pipeline/. 
The latter computes the true positive (TP), false positive (FP) 
and false negative (FN) tertiary structure interactions between the 
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Table 1. 



Percentage of well-predicted base pairs 
in the predicted secondary structures 



Secondary structure 
selected by RNA-MoIP 



PDB 


Optimal solution 


Average over all 


Rank in 


Nb. of candidate 








spfnnHarv slriirtiirps 


R1\TA cn iViottI - list 


^ppnnH^rv <i1riirtiirp<i 




Before 


After 








3E5C 


100 


100 


100 


1 


2 


1DK1 


88 


92 


82 


1 


7 


1MMS 


47 


67 


49 


2 


2 


2DU3 


79 


100 


44 


52 


58 


3D2G 


91 


100 


43 


163 


243 


2HOJ 


68 


68 


61 


13 


20 


2GDI 


96 


94 


71 


10 


22 


1LNG 


100 


100 


82 


1 


7 


1MFQ 


29 


31 


31 


1 


4 


Average 


78 


84 


63 







The first column shows the PDB identifier of the RNAs. The two following columns show the ratio of well-predicted base pairs in two structures. The former is the structure in the 
optimal solution of the RNA and the latter is the same structure after RNA-MoIP was applied and removed some base pairs (we highlight in bold the improved scores). There is 
an average increase of 6% and only one case where it decreased. The fourth column represents the average of well-predicted base pairs for each RNA over all secondary structures 
considered by RNA-MoIP. The penultimate column shows the rank of the best secondary structure selected by RNA-MoIP in the ordered list of suboptimal secondary structures 
generated with RNAsubopt, whereas the last column shows the total number of suboptimal secondary structures in that list. 
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Fig. 3. J3fr) shows the PPV and STY for all the 3D structures generated by our scripts on MC-Sym against the reference on the PDB jBerman et a/.Ll200Qb . 
0)) Shows in green the distribution of the RMSD of the solutions obtained with MC - S ym when the structures of the motifs were given and in blue when 
only the secondary structure was provided. N.B.: In the latter, the molecules are identified by their size 



experimental structures deposited in the PDB jBerman et q/.ll200()l) 
and our predictions, and returns the positive predictive value (PPV) 
and sensitivity (STY) defined as: 

|TP| |TP| 



PPV:= 



STY: 



|TP| + |FP| |TP| + |7W| 

We report our results in Figure|5] Figure|5]3 shows that RNA-MoIP 
coupled with MC - Sym is able to predict most of the tertiary structure 
interactions. 

We show in green in Figure the RMSDs of the solutions 
obtained with MC - Sym as described in Section 13.31 We recall that 
each script for MC - Sym is done manually with the positions inside 
the predicted motifs directly mapped to the pool of corresponding 



3D st ructures, obtained by RNA3Dmotifs from (Djello ul and 
Denise J2008l) . We also recall that ( lLaing and SchlickLl20ld) report 
that only the two smallest structures were resolved by MC-Fold 
|MC-Sym when only the sequence was given. We thus decided to 
input into the MC-Fold | MC-Sym pipeline the sequence with the 
secondary structure selected by RNA-MoIP. Under this scenario, 
MC-Sym was allowed to run for 48 h. Those results are shown in 
blue. As we can see, having the secondary structures allowed to 
solve five of the seven structures. We note that two of them only 
produced seven solutions in the first half-hour. The largest one 
took more then 4 h to produce the first results, and had only two 
solutions after the 48 h. Nonetheless the information given by the 
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Table 2. 
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3 -way 
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71 
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139.96 


7 


0.90 
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0.04 
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2.91 


0.44 


4-way (tRNA) 
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243 
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3 -way (ribos witch) 
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3 -way (ribos witch) 
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22 


47.1 
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110.96 


146 


0.88 


0.84 


0.02 


2.73 
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1.91 


3-way (SRP) 


1MFQ* 


128 
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46.06 


14 


0.77 


0.72 


0.03 


9.07 


14.34 


5.01 
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In the first column RNA identifiers are followed by an '*' or a '#' to denote that MC-Sym (reps. NAST) failed to predict them, as reported bv lLaing and Schlicld Eoicb . The second 
column contains the length of each RNA. The third column contains the number of secondary structure predicted by RNAf old and used as input for RNA-MoIP. The fourth column 
is the total time (preprocessing and solve) in seconds taken by RNA-MoIP to find an optimal solution for all the secondary structures. The fifth column contains the number of 3D 
structures generated by MC-Sym with the script made with RNA-MoIP optimal solution. We then have the maximal, average and SD of the MCC. The following three columns 
present the minimal, average and SD of the RMSD. Finally, the last column indicate the type of junction found in the native structure. 
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Fig. 4. Comparison of the RMSD obtaine d by RNA -MoIP 



MC-Pipeline, iFoldRNA and FARNA 



by |La 



and Schlick 



' 201C ). This figure is derived from the data computed bv lLaing and Schlick 
i 20101) on which we superposed the results obtained by RNA-MoIP and 
MC-Sym. The dots are the average RMSD shown in Figure (2). We also 
show in dotted line the extrapolated RMSD for MC-Sym and in black the 
best fit for the average RMSD obtained with our pipeline 

motifs allows to our method to predict significantly more accurate 
results. 

Figure |4] shows that our program outperforms other software and 
produces 3D structures with a RMSD s ignificantly lower than those 
observed by lLaing and Schlickl bOlOh for other programs. It also 
shows that our method scales with the length of the RNA better than 
other approaches. 

We completed this analysis by computing the Matthews 
correlation coefficient (MCC), defined as: MCC := VPPV x STY, 
and the running time of our method. We show in Tableman overview 
of the results obtained on each RNA. We note the fast execution time 
of RNA-MoIP, even when a large number of secondary structures 
are used. Also, despite a time limit of 30 min, MC-Sym generates 
good candidates. Noticeably, our RMSD can be as low as 2.23 A for 



the tRNA 2DU3 and are con siderably smaller than those reported 
by (lLaing and Schlickl 120 id) (Fig. HI. 



4 DISCUSSION 

In this article we demonstrated that large RNA 3D structures can be 
automatically predicted using a hierarchical approach. We benefits of 
the progresses accumulated over the last 30 years in the field of RNA 
secondary structure prediction and, using an IP framework, expands 
these methods to incorporate the novel local motifs information 
available in databases. We show that this approach enables us to 
predict very quickly accurate 3D structures for large RNA sequences 
(>50 nucleotides). By contrast, previous methods were either too 
slow or too inaccurate on molecules with similar sizes. 

We show that motif insertion enables us to identify the best 
secondary structures in a pool of suboptimal structures. Nonetheless, 
the choice of the size of the sample set that need to be generated 
remains an open problem. As illustrated by the 3D2G and 2DU3 
experiments, some RNAs may require significantly more suboptimal 
than those generated by default by RNAsubopt. A simple strategy 
to reduce the search space would be to cluster those samples and 
pick representatives structures. 

RNA-MoIP demonstrates that we can already benefit from the 
information accumulated in RNA local motif databases without 
deriving a new model. This is important because the paucity 
of the data currently available in these databases prevents us to 
develop accurate statistical potentials for predicting tertiary structure 
interactions in high-order motifs such as the &-way junctions. 

Therefore, another important issue with RNA-MoIP is the 
completeness of the motif database. For instance, we have seen that 
there is no homologous 3-way junction in our database that can be 
correctly inserted in 3EGZ (riboswitch in H. sapiens) and 20IU 
(synthetic ribozyme). To circumvent this limitation, an interesting 
approach would be to generate highly pr obable new motifs from th e 
existing ones using isostericity matrices dStombaugh et al 1[2003). 

Some of the IP techniques developed here could be re-written 
using a dynamic programming approach. However, we argue that 
the IP approach is more flexible and more suited to this problem. 
Indeed, in our framework the rules of insertions can be easily 
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modified (i.e. adding, removing or changing an equation) whereas 
a dynamic programming scheme would required a complete re- 
implementation. This is particularly useful in this case where some 
motifs present in our databases have specific insertion constraints. 
This situation is more likely to happen in the future with the 
growth of RNA local motif databases. Moreover, we demonstrated 
in this article that our implementation is fast enough for realistic 
applications. 

Finally, our methods are compatib le with state-of-the-a rt IP 
prog rams for pseudo-knot predictions jPoolsap et all 120091; Sato 
et al, 120111) . In future work, we could envision to merge the two 
models and include new rules for inserting highly sophisticated 
3D motifs with long-range interactions, coaxial stacking or base 
triplets. Beside its inherent flexibility, the development of IP models 
for RNA structure prediction finds a justification in recent results 
showing the inapproximability of the prediction of RNA pseudo- 
knot ted secondary structures with a nearest neighbour model ( Sheik 
et q/.. l2012h . ~ " ~ 
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